E-Commerce Customer Purchase Analysis¶

We want to study an E-commerce dataset:

  • Load the data and inspect it to understand its structure. (10 points)
  • Complete each task by creating the required visualizations using Plotly. (10 points each)
  • Use clear titles and labels for your charts so that they are easy to understand. (20 points)
  • Use Markdown cells to provide brief explanations of your findings and insights. (10 points)

1. Create a vertical bar chart that shows the total sales (in dollars) for the top 10 products based on Description. Calculate total sales as Quantity * UnitPrice.¶

2. Create a horizontal bar chart showing total sales by country for the top 10 countries. Calculate total sales per country as the sum of Quantity * UnitPrice for each country.¶

3. Create a pie chart to display the distribution of quantities sold for the top 6 products by quantity.¶

4. Convert InvoiceDate to month and create a line chart showing total monthly sales. The x-axis should represent months, and the y-axis should represent total sales in dollars.¶

5. Create a box plot showing the distribution of UnitPrice for the top 6 products by total quantity sold.¶

6. Create a scatter plot showing the correlation between UnitPrice and Quantity for all products. Use a trend line to show any correlation between price and quantity.¶

Cristian Lorenzo - HW3¶

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

file = pd.read_csv('data.csv',encoding="ISO-8859-1")
df = file.copy()

1. Create a vertical bar chart that shows the total sales (in dollars) for the top 10 products based on Description. Calculate total sales as (Quantity * UnityPrice).¶

In [2]:
total_sales = df[['Description', 'Quantity', 'UnitPrice', 'CustomerID']].copy()
total_sales['Total Sales'] = total_sales['Quantity'] * total_sales['UnitPrice']

ten_products = (
    total_sales.groupby('Description')['Total Sales']
    .sum()
    .sort_values(ascending=False)
    .head(10)
    .reset_index()
)

ten_products['Total Sales'] = ten_products['Total Sales'].round(2)

fig = px.bar(
    ten_products, 
    x='Description', 
    y='Total Sales', 
    title='Top 10 products by Total Sales', 
    labels={
        'Description': 'Product Description', 
        'TotalSales': 'Total Sales (in Dollars)'
    },
    text='Total Sales'
)

fig.update_layout(xaxis_tickangle=45)

fig.show()

The product with the highest total sales from the 'Top 10 products by total sales' was 'DOTCOM POSTAGE' with 206,245 and the lowest was 'CHILLI LIGHTS' with 53768.

2. Create a horizontal bar chart showing total sales by country for the top 10 countries. Calculate total sales per country as the sum of Quantity * UnitPrice for each country.¶

In [3]:
total_sales = df[['Description', 'Quantity', 'UnitPrice', 'Country']].copy()
total_sales['Total Sales'] = total_sales['Quantity'] * total_sales['UnitPrice']

ten_countries = (
    total_sales.groupby('Country')['Total Sales']
    .sum()
    .sort_values(ascending=False)
    .head(10)
    .reset_index()
)

ten_countries['Total Sales'] = ten_countries['Total Sales'].round(2)

fig = px.bar(
    ten_countries, 
    x='Total Sales', 
    y='Country', 
    title='Top 10 Countries by Total Sales', 
    labels={
        'Total Sales': 'Total Sales (in Dollars)', 
        'Country': 'Country'
    },
    text='Total Sales', 
    orientation='h'
)

fig.show()

The country that had the most sales was United Kingdom. Which had significantly more than any other country as it had a difference of 7 million. From the 'Top 10 countries by Total Sales' Sweden had the least by 36595.

3. Create a pie chart to display the distribution of quantities sold for the top 6 products by quantity.¶

In [4]:
total_quantity = df[['Description', 'Quantity', 'CustomerID']].copy()

six_products = (
    total_quantity.groupby('Description')['Quantity']
    .sum()
    .sort_values(ascending=False)
    .head(6)
    .reset_index()
)

fig = px.pie(
    six_products, 
    values='Quantity', 
    names='Description', 
    title='Distribution of Quantities Sold for Top 6 Products', 

)

fig.show()

The top product sold by quantity was 'World War 2 Gliders Assstd Desgins' with it being 22% and the lowest was 'White Hanging Heart T-Light Holder' with it being at 14.4%

4. Convert InvoiceDate to month and create a line chart showing total monthly sales. The x-axis should represent months, and the y-axis should represent total sales in dollars.¶

In [5]:
monthly_sales = df.copy()

monthly_sales['InvoiceDate'] = pd.to_datetime(monthly_sales['InvoiceDate'])

monthly_sales['Month'] = monthly_sales['InvoiceDate'].dt.to_period('M')

monthly_sales = (
    monthly_sales.groupby('Month')
    .apply(lambda x: (x['Quantity'] * x['UnitPrice']).sum())
    .reset_index(name='Total Sales')
)

monthly_sales['Month'] = monthly_sales['Month'].astype(str)

fig = px.line(
    monthly_sales,
    x='Month',
    y='Total Sales',
    title='Total Monthly Sales',
    labels={'Month': 'Month', 'Total Sales': 'Total Sales (in Dollars)'},
    markers=True
)

# Display the chart
fig.show()

There was a peak on November 2011 with it being on 1.46 million. Then had a drop to 433k on December. I was a little caught by suprise on this because I would expect there to be more sales throughout the holidays.

5. Create a box plot showing the distribution of UnitPrice for the top 6 products by total quantity sold.¶

In [6]:
top_products = df.groupby('Description')['Quantity'].sum().nlargest(6).index

tp_df = df[df['Description'].isin(top_products)]

fig = px.box(
    tp_df,
    x='Description',
    y='UnitPrice',
    title='Distribution of UnitPrice for Top 6 Products by Quantity Sold',
    labels={'Description': 'Product Description', 'UnitPrice': 'Unit Price ($)'},
)

fig.show()

The 'White Hanging Heart T-Light Holder' had the greatest max at 6.77. The second highest was the 'Pack of 72 Retrospot Cake Cases' with a max of 5. The one that had the lowest distribution was 'World War 2 Gliders Asstd Designs'

6. Create a scatter plot showing the correlation between UnitPrice and Quantity for all products. Use a trend line to show any correlation between price and quantity.¶

In [7]:
fig = px.scatter(
    df, 
    x='UnitPrice', 
    y='Quantity', 
    title='Correlation between Unit Price and Quantity', 
    labels={
        'UnitPrice':'Unit Price (in dollars)',
        'Quantity':'Quantity Sold'
    },
    trendline='ols',
    
)

fig.show()

For the Unit Price and quantity, the scatter plot was a little difficult to do for me. The unit price 2.08 a quantity of 80k was sold